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Abstract 

Gene duplication is a widespread phenomenon in genome evolution, and it has been proposed to serve as an engine of 
evolutionary innovation. In the present study, we performed the first comprehensive analysis of duplicate genes in the 
bovine genome. A total of 3131 putative duplicated gene pairs were identified, including 712 cattle-specific duplicate gene 
pairs unevenly distributed across the genome, which are significantly enriched for specific biological functions including 
immunity, growth, digestion, reproduction, embryonic development, inflammatory response, and defense response to 
bacterium. Around 97.1 % (87.8%) of (cattle-specific) duplicate gene pairs were found to have distinct exon-intron structures. 
Analysis of gene expression by RNA-Seq and sequence divergence (synonymous or non-synonymous) revealed that 
expression divergence is correlated with sequence divergence, as has been previously observed in other species. This 
analysis also led to the identification of a subset of cattle-specific duplicate gene pairs exhibiting very high expression 
divergence. Interestingly, further investigation revealed a significant relationship between structural and expression 
divergence while controlling for the effect of synonymous sequence divergence. Together these results provide further 
insight into duplicate gene sequence and expression divergence in cattle, and their potential contributions to phenotypic 
divergence. 
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Introduction 

Gene duplication is thought to be a major driving force of 
evolution as it provides raw materials that selection can act upon 
[1-3]. Duplicate genes, the products of gene duplication, initially 
have identical sequences and functions but tend to diverge in 
sequence and expression patterns later [4,5]. The redundancy 
conferred by duplicate genes may be important for organismal 
adaptation to different environments [6] . Several previous studies 
have investigated divergence between duplicate genes at the 
genome scale [7-16]. In Arabidopsis , over 95% of duplicate genes 
studied have diverged in exon-intron structure, and structural 
divergence occurs largely proportionally to evolutionary time [7] . 
Gu et al examined expression divergence between 400 duplicate 
gene pairs in yeast using microarray data and found a positive 
correlation between synonymous sequence divergence (a proxy of 
evolutionary time) and expression divergence [10]. A similar 
conclusion was reached in an analysis of human duplicate genes 
[13]. However, a later study in Arabidopsis found that synonymous 
sequence divergence and expression divergence were not corre- 
lated [14]. 



Duplicate genes are common in eukaryotic genomes and may 
be responsible for species-specific gene functions, which in turn 
might facilitate species-specific adaptation [17—22]. In farm 
animals, there is economically important variation in a diverse 
range of phenotypes related to, for example, reproduction and 
body structure [23,24]. In cattle, both natural and artificial 
selection over a relatively short period of time have resulted in a 
broad variety of phenotypic and genetically diverse breeds [25- 
27]. Segmental duplications are widespread in the bovine genome 
and it has been observed that genes overlapping with segmental 
duplications are significandy enriched for biological functions such 
as immunity, digestion, lactation and reproduction [26,28]. 
However, because of the way segmental duplication is defined, 
duplicate genes within segmental duplications have high levels of 
sequence identity [28]. Hence, only duplicate gene pairs with high 
sequence identity (median = 98.9%) have been investigated 
previously in cattle. In addition, the relationship between sequence 
divergence and expression divergence has not been investigated in 
this species, nor the relationship between gene structure 
divergence and expression divergence. Investigation of these 
relationships may provide further insight into the evolution of 
genes arising through duplication. 
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The purpose of the present study is to identify duplicate genes in 
cattie and to characterize their sequence, gene structure and 
expression divergence. 

We found 3131 putative duplicated gene pairs across the 
genome, including 7 1 2 cattle-specific duplicate gene pairs. Catde- 
specific duplicate genes are significantly enriched for biological 
functions such as immunity, growth, digestion, reproduction, 
embryonic development, inflammatory response, and defense 
response to bacterium. 3035 (625 cattle-specific) duplicate gene 
pairs were found to have distinct exon-intron structure, and 
further investigation showed that exon-intron structure divergence 
occurred quickly at the early stages of duplicate gene evolution. 
Expression divergence analysis revealed a positive correlation 
between synonymous sequence divergence and expression diver- 
gence, as expected, and led to the identification of a subset of cattle 
genes exhibiting high expression divergence. Finally, we investi- 
gated the relationship between structural and expression diver- 
gence and found that expression divergence was on average 
greater for duplicate gene pairs exhibiting exon-intron structure 
divergence, regardless of their overall level of synonymous 
sequence divergence, suggesting structural divergence may be 
partially responsible for the divergent expression pattern observed. 

Materials and Methods 

Identification of duplicated gene pairs 

All bovine protein sequences were downloaded from Ensembl 
database release 67 [29]. To identify duplicate gene pairs, a 
previously described approach was followed [30]. BLASTP [31] 
was used to compare every protein against all other proteins. 
Reciprocal best BLAST hits were classified as duplicates if (1) the 
length of the BLASTP aligned region was &80% of the longer 
protein, and (2) the identity between them was / > 30% if the 
aligned region is longer than 150 amino acids and 
7>0.01 x6 + 4.8L- 032 l 1 - t,,c ' ,( -' L / 1000) ] for all the other protein 
pairs. Pseudogenes were identified based on Ensembl annotation 
and discarded. Annotations from the Ensembl database were also 
used to further classify duplicates as non-cattle-specific duplicates 
(i.e. also observed in other species in the Ensembl database) or 
catde-specific duplicates (i.e. observed only in the bovine genome). 

Synonymous and nonsynonymous sequence divergence 

Protein sequences for each duplicated gene pair were aligned 
using ClustalW [32]. The protein alignments were then used as a 
guide to align the coding sequences (from Ensembl release 67) 
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Figure 1. Distribution of sequence identity between duplicate 
gene pairs. 

doi:10.1371/journal.pone.0102868.g001 



using a custom Perl script. The synonymous (ds) and nonsynon- 
ymous (dff) sequence divergence were calculated for each pair of 
aligned coding sequences, using the maximum likelihood method 
implemented in the Phylogenetic Analysis Using Maximum 
Likelihood (PAML) package (version 4) [33] . 

Exon-intron structure difference between duplicate gene 
pairs 

A previously described method was followed to determine 
whether duplicate genes have diverged in exon-intron structure 
[7]. Duplicate genes were regarded as structurally divergent if they 
had a different number of exons (termed Class 1 difference) or if 
they had the same number of exons but the lengths of at least one 
pair of homologous exons were different (termed Class 2 
difference). The number of exons and the length of exons were 
retrieved from Ensembl database release 67. 

Sample collection, library construction, sequencing and 
expression profiling 

To measure gene expression, mRNA from seven different 
tissues was extracted (adipose, muscle, hypothalamus, duodenum, 
liver, lung and kidney) from frozen tissues using TRIzol 
(Invitrogen). The original samples were collected from beef catde 
at the Lacombe research station in Alberta (Canada), following the 
guidelines of the Canadian Council on Animal Care (1993), and 
the protocol approved by the Lacombe Research Centre Animal 
Care Committee. Messenger RNA from 7 — 1 4 animals was pooled 
equally before sequencing (Table SI). Sequencing libraries were 
constructed from each pooled tissue according to a standard 
protocol (mRNA Sequencing Sample Preparation Guide, Illu- 
mina, USA). 

Sequencing was performed on the Illumina Genome Analyzer 
II following the manufacturer's recommendations. Raw reads that 
failed the chastity filter and reads with average quality less than 20 
were removed. The remaining reads were aligned to the UMD3 
bovine genome assembly [34] using TopHat vl.4.0 [35]. Cufflinks 
vl.3.0 [36] was then used to quantify the expression of each 
transcript in each tissue. Raw sequence data are available in the 
ArrayExpress database (www.ebi.ac.uk/ arrayexpress) under ac- 
cession number E-MTAB-2596. 

Expression profile similarity between duplicate genes 

Following a previous study [37], the FPKM values were log- 
transformed using log2 (FPKM+offset) with an offset =1.0. To 
measure the similarity in expression profile between duplicate 
genes, Pearson's correlation coefficient r of expression level across 
all seven tissues between each duplicate pair was calculated. A 
high r indicates a high similarity in expression profile between 
duplicate genes. Furthermore, to study the relationship between ds 
(or dff) and the correlation coefficient r, we analyzed only the gene 
pairs in which both genes were expressed in at least one tissue. A 
gene was treated as not expressed if FPKM was < 1 and classified 
as expressed if FPKM >2. The Pearson's correlation coefficient r 
was transformed into ln[(l +r)/(l — r)]. As shown in Figure SI, 
the transformation serves to make the scale more appropriate for 
linear regression analysis [10,13]. The linear regression was 
carried out between ds (or dff) and the transformed r. 
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Figure 2. The relationship between proportion of duplicate 
gene pairs with different exon-intron structure and sequence 
divergence. Synonymous divergence is used to represent sequence 
divergence. Each point represents 200 gene pairs. 
doi:1 0.1 371 /journal.pone.01 02868.g002 

Results 

Distribution of duplicate gene pairs in the bovine 
genome 

A total of 3131 putative duplicate gene pairs were identified, 
including 712 cattle-specific duplicate gene pairs (Table S2). The 
sequence identities between duplicate gene pairs vary from 30% to 
100% (Figure 1). Not surprisingly, the distribution of identities 
between catde-specific duplicate gene pairs is gready shifted to the 
right when compared with the distribution of all duplicate genes 
(P=1.74x 10 - 303 , Figure 1). Of 3131 duplicate gene pairs, 2165 
(69%) are interchromosomal duplications (Table 1). However, 
only 215 (30%) cattle-specific duplicate gene pairs are interchro- 
mosomal duplications, presumably reflecting the fact that dupli- 
cates tend to arise by intrachromosomal duplications but that over 
time additional changes to the genome can further separate 
duplicates. Consistent with this, compared with interchromosomal 
duplications, intrachromosomal duplication gene pairs show 
obvious higher identities (P = 1.19 x 10~ 76 , Figure 1), as was also 
noted in a catde study of segmental duplications [28]. 712 cattle- 
specific duplicate gene pairs are distributed in a nonrandom 
fashion in the genome. Duplicate content varies significantly 
among different chromosomes (Table 1). Chromosomes 15, 29, 
and X show the greatest enrichment for duplicate genes with more 
than twofold the duplicate content of the genome average. 
Interestingly, Chromosomes 15, 29 and X all have significandy 



more intrachromosomal duplications than interchromosomal 
duplications (Table 1). 

Functional roles of cattle-specific duplicate genes 

Consistent with similar duplication studies in the other 
mammals and a previous cattle segmental duplication study 
[28,38-41], we also observed many duplicate genes that are 
important for drug detoxification, immunity and receptor and 
signal recognition (such as cytochrome P450, ribonuclease A and 
beta defensins). In order to test whether any particular function is 
overrepresented in catde-specific duplicate genes, we performed 
singular enrichment analysis using agriGO [42]. Statistically 
significant over representations were observed for multiple 
biological processes (Table S3), such as 'response to stimulus' 
(P = 4.5 x 10~ 201 ), 'response to chemical stimulus' 
(P = 2.0x 10- 150 ), 'reproduction' (P = 9.0 x lO" 47 ), 'defense re- 
sponse' (P = 8.5 x 10~ 44 ), 'immune system process' 
(P = 1.7 x 10~ 40 ), 'immune response' (P= 1.7 X 10 -36 ), 'embry- 
onic development' (P= 3.3 x 10~ 32 ), 'inflammatory response' 
(P= 3.3 x 10~ 14 ), 'response to wounding' (P = 5.3 x 10~ n ), 'in- 
nate immune response' (P= 1.7 x 10~ 12 ), 'defense response to 
bacterium' (P= 1.5 x 10~ 9 ), and 'digestion' (P = 2.3x 10~ 7 ). We 
also observed over representations for several cell components, 
such as 'plasma membrane' (P = 9.3 x 10~ 124 ), 'MHC protein 
complex' (P = 6.8 x 10~ 18 ) and 'MHC class I protein complex' 
(P=3.1 x 10- 15 ). 

Extensive changes in exon-intron structure between 
duplicate gene pairs 

Comparisons of 3131 duplicate gene pairs showed that 1827 
pairs (58.4%) had a different number of exons (Class 1). In 1198 
other cases (38.2%), the number of exons remained the same, 
whereas the lengths of one or more homologous exons were 
different (Class 2). Our approach of comparing exon numbers and 
lengths identified 3025 (96.6%) duplicate genes with obvious 
differences in gene structure. Slightly fewer catde-specific duplicate 
gene pairs (87.8% of cattle-specific duplicates, 383 pairs for Class 1 
and 242 pairs for Class 2) have different exon-intron structures. 
We next examined whether the proportion of duplicate gene pairs 
with different exon-intron structure is correlated with evolutionary 
time. A significant positive correlation is observed between the 
proportions of gene pairs with diverged exon-intron structure and 
ds (P = 0.01, r = 0.95, Figure 2), as was also noted in a previous 
study in Arabidopsis [7] . 



Table 2. Percentage of differentially expressed duplicate gene pairs by tissue. 



Tissues 


Percentage of all duplicate gene pairs 


Percentage of cattle-specific duplicate gene pairs 


adipose 


20.66 


13.32 


liver 


20.60 


11.23 


duodenum 


21.49 


11.37 


hypothalamus 


17.15 


11.51 


lung 


21.88 


13.34 


muscle 


22.52 


15.59 


kidney 


20.72 


11.80 
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Figure 3. The relationship between Pearson's correlation coefficient of gene expression and sequence divergence. (A) A significant 
negative correlation implies a positive correlation between d s and expression divergence because 1 — r can be regarded as expression divergence. 
Each point represents five gene pairs. (B) A negative correlation between transformed r and d N with <0.2. Each point represents five gene pairs. 
(C) No correlation between transformed r and d N with dpj>0.2. Each point represents five gene pairs. 
doi:1 0.1 371 /journal.pone.01 02868.g003 



Expression divergence of duplicate gene pairs 

To measure divergence in expression between duplicate genes, 
we performed transcriptome sequencing for the following tissues: 
adipose, muscle, hypothalamus, duodenum, liver, lung and kidney 
(library details are given in Table SI and expression values are 
given in Table S4). We first calculated the proportion of duplicate 
genes with divergent expression. Two duplicate genes were treated 
as having diverged expression in a particular tissue if one gene is 
expressed in that tissue whereas the other is not. The expression 
data shows that 17-23% of duplicate gene pairs have divergence 
in one of the seven tissues studied (Table 2). In total, 59.8% of 
duplicate gene pairs have diverged in expression in at least one 
tissue and 38.6% of duplicate gene pairs have diverged in at least 
two tissues. Another way of measuring divergence in expression 
between duplicate genes is to compute Pearson's correlation 



coefficient r. A significant negative correlation is observed between 
transformed r and d s (P = 6.6x 10~ 7 , Figure 3A). In addition, a 
weaker negative correlation r is observed between d N and 
transformed r when only gene pairs with d^<Q.2 are examined 
(P = 2.4x 10~ 6 , Figure 3B). With d N >Q.2, the correlation is no 
longer statistically significant (P = 0.91, Figure 3C). Here, we 
analyzed the gene pairs in which both genes were expressed in at 
least one tissue. We observed the same significant relationships 
between expression divergence and sequence divergence when the 
analysis was restricted to pairs of genes that were expressed in 
three or more tissues (Figure S2). These findings regarding 
divergence in the spatial pattern are consistent with previous 
studies in yeast and human [10,13]. 

Among the cattle-specific gene duplicates, which in general have 
diverged much more recently than the duplicates as a whole, 




Figure 4. The relationship between Pearson's correlation coefficient of gene expression and structural divergence. (A) Four different 
types of duplicate gene pairs: not structurally divergent (NSD), structurally divergent (SD), not structurally divergent cattle-specific (NSD cattle- 
specific), structurally divergent cattle-specific (SD cattle-specific). (B) The linear regression model with expression similarity as the dependent variable 
and the synonymous divergence as explanatory variable was constructed for both structurally divergent duplicate gene pairs and not structurally 
divergent duplicate gene pairs. 
doi:1 0.1 371 /journal.pone.01 02868.g004 
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expression divergence is also observed. 11-13% of catde-specific 
duplicates showed evidence of divergence of transcriptional levels 
between the duplicates (Table 2). We next focused on catde- 
specific duplicate gene pairs showing dramatic changes in 
expression, perhaps signaling important functional divergence. 
We examined catde-specific duplicate gene pairs with a Pearson's 
correlation coefficient r<0.5. There were 31 such duplicate gene 
pairs in this group (Table S5). Interestingly, most of the gene pairs 
are structural divergent (30 of 31). These genes, which are 
enriched for certain functional activities, such as 'helicase activity' 
(P = 6.8x 10- 6 ), 'hydrolase activity' (P = 0.00018), 'catalytic 
activity' (P = 0.00058), and 'transferase activity' (P = 0.003), may 
have been targets of positive selection, perhaps through their 
contributions to important cattle adaptations. 

Relationship between structural divergence and 
expression divergence 

We compared the expression divergence of duplicate gene pairs 
with conserved gene structure to those with different exon-intron 
structure (Figure 4A). We observed that the expression divergence 
between structurally divergent duplicate gene pairs was signifi- 
candy higher than the divergence between duplicate gene pairs 
with the same exon-intron structure, in comparisons involving all 
duplicate gene pairs (P = 0.007) and in comparisons of cattie- 
specific duplicate gene pairs (P = 0.0002). This result might be due 
to the fact that both structural and expression divergence are 
related with evolutionary time (synonymous divergence d s ). We 
wondered if gene structure changes themselves account for some 
of the expression divergence, perhaps through altering, for 
example, transcription or splicing efficiency. We constructed a 
linear regression model with expression similarity as the dependent 
variable and the synonymous divergence as explanatory variable 
for both structurally divergent duplicate gene pairs and non- 
structurally divergent duplicate gene pairs. We found that 
expression divergence was on average greater for duplicate gene 
pairs exhibiting exon-intron structure divergence (P = 0.0007), 
regardless of their overall level of synonymous sequence 
divergence (Figure 4B). Further analysis of covariance showed 
that a significant relationship exists between structural and 
expression divergences while controlling for the effect of synon- 
ymous sequence divergence (P = 0.0068). Restricting the analysis 
to genes expressed in three or more tissues did not change the 
findings (Figure S3). 

Discussion 

Cattle-specific duplicate gene pairs 

Although gene duplications have been studied in the context of 
segmental duplications in catde, we sought to more broadly 
characterize the repertoire and function of duplicate genes in this 
species using less stringent identity thresholds for discovery 
together with RNA-Seq for characterizing expression. Consistent 
with a previous cattle segmental duplication study [28], we 
observed many genes involved in ruminant or cattle specific 
aspects of reproduction including pregnancy-associated glycopro- 
tein, interferon alpha and beta, trophoblast Kunitz domain 
proteins and prolactin-related proteins. These genes are related 
with fetal growth, maternal adaptations to pregnancy, and the 
coordination of parturition. We also found considerable gene 
duplications involved in adaptive immune responses in cattle. 
Duplication of genes involved in immune response or response to 
other organisms (e.g. bacterium) may be particularly important to 
catde due to the substantial load of microorganisms present in the 
rumen [28]. We found that most duplicate gene pairs in cattle, 



including those classified as cattle-specific, exhibit structural 
divergence. Expression analysis led to the identification of a subset 
of catde-specific duplicate gene pairs exhibiting high expression 
divergence. Because our analyses are based on currendy available 
and undoubtedly imperfect genome assemblies and gene annota- 
tions, we expect that some of the duplicate pairs we identified and 
characterized are not bona fide gene pairs. Nonetheless, our results 
regarding cattle-specific duplicate genes should provide insight 
into the evolution of duplicate genes both at the sequence and 
expression levels. 

Relationship between structural divergence and 
expression divergence 

Duplicate genes initially have similar sequences and functions 
but tend to divergence in regulatory and coding regions. It has 
been shown that the protein sequence divergence is positively 
correlated with expression divergence in Drosophila [43]. How- 
ever, the relationship between exon-intron structure and expres- 
sion divergence is not well understood. In a previous study of 
duplicate genes in Arabidopsis thaliana [8], structural divergence 
were found to be positively correlated with expression divergence. 
However, synonymous divergence was not taken into account in 
this previous study. In the present study, we found a significant 
relationship between structural and expression divergence while 
controlling the effect of synonymous sequence divergence. We can 
think of several explanations for this relationship. For example, 
positive selection could sometimes favor both gene structure and 
expression changes in the same gene because both might be 
contribute to a new beneficial function; genes subjected to relaxed 
negative selection may tend to accumulate both gene structure- 
altering and expression-altering mutations; and changes in gene 
structure might direcdy alter expression, though effects on 
transcription or transcripts. 

Conclusion 

In this study, we identified duplicate genes in the bovine 
genome and further classified them as cattle-specific and non- 
catde-specific. We found that structural divergence is common 
between duplicate genes and increases with evolutionary time, as 
expected. Using RNA-Seq, we investigated the relationship 
between gene expression and other characteristics and found a 
positive correlation between expression divergence and synony- 
mous sequence divergence, as well as a significant relationship 
between structural and expression divergence while controlling the 
effect of evolutionary time. Our findings not only further support 
previously observed relationships observed in other species, but 
also describe a set of cattle-specific duplicate gene pairs, some of 
which may contribute to adaptations in this species. 

Supporting Information 

Figure SI Histograms of Pearson's correlation coeffi- 
cient r and transformed r. (A) Pearson's correlation coefficient 
r. (B) The transformation (l+r)/(l-r). (C) The log transformation 
of (l+r)/(l-r). 
(TIFF) 

Figure S2 The relationship between Pearson's correla- 
tion coefficient of gene expression and sequence diver- 
gence when the analysis was restricted to pairs of genes 
that were expressed in three or more tissues. 

(TIFF) 
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Figure S3 The relationship between Pearson's correla- 
tion coefficient of gene expression and structural 
divergence when the analysis was restricted to pairs of 
genes that were expressed in three or more tissues. 

(TIFF) 

Table SI Details of sequencing libraries. 
(XLSX) 

Table S2 Complete list of duplicated gene pairs. 
(XLSX) 

Table S3 GO enrichment analysis of Catde-specific duplicate 

genes. 

(XLSX) 



Table S4 Expression values in the seven tissues studied. 
(XLSX) 

Table S5 Catde-specific duplicate genes that rapidly diverged in 

expression. 

(XLSX) 
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